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Abstract. We propose a concise approximate description, and a method for efficiently obtain- 
ing this description, via adaptive random sampling of the performance (running time, mem- 
ory consumption, or any other profileable numerical quantity) of a given algorithm on some 
low-dimensional rectangular grid of inputs. The formal correctness is proven under reasonable 
assumptions on the algorithm under consideration; and the approach's practical benefit is demon- 
strated by predicting for which observer positions and viewing directions an occlusion culling 
algorithm yields a net performance benefit or loss compared to a simple brute force renderer. 
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1 Introduction 

Although it is possible to give bounds for different aspects of many algorithms' runtime behav- 
iors (like for running time or memory consumption) by formal analysis, the input can heavily 
influence the actual behavior (e.g. algorithms for real-time rendering of virtual 3d-scenes). 
To evaluate algorithms for practical applications, in which the input (or the characteristics 
of the input) is known, a more detailed estimation of the algorithm's behavior s necessary in 
order to 

• select the appropriate algorithm for the given setting (hardware, application and input). 

• find suitable parameters to adapt the selected algorithm to the setting. 

• identify bottlenecks and starting points for further improvements of the algorithm. 

If the number of possible inputs is sufficiently small, it may be possible to evaluate the observed 
property of the algorithm (e.g. the running time) for every input and use this as basis for 
the evaluation. But in most cases the input space is too big (e.g. all possible positions inside 
a virtual scene), so that only a few samples can be evaluated experimentally. For a simple 
uniform sampling approach, it is difficult to capture the structures the input may exhibit. 
Another common way in computer graphics is e.g. to select a camera path covering all relevant 
inputs manually. 

But if small changes in the input mostly lead to small changes in the behavior of the 
algorithm only, we can apply our adaptive sampling method. Thereby the input space is 
subdivided into regions, in which the algorithm behaves similarly. This subdivision can then 
represent an easy-to-handle model of the algorithm. 

In the following, we 

i) present the method that creates this subdivision of the input. 

ii) prove that, if the function, which describes the behavior of the algorithm is Lipschitz- 
continuous, it can be approximated by this method. 

iii) evaluate the method in the domain of real-time 3d rendering, in which we can make use 
of the distinct local coherence of many rendering algorithms. 

The goal of our approach is to preprocess a given algorithm via blackbox queries at certain 
inputs, in order to quickly and approximately predict its behavior on other inputs. The data 
structure constructed during preprocessing is a hierarchical subdivision described in Section [2] 
and gives a kind of global picture of the algorithm under consideration with many applications 
to Algorithm Engineering (Section 12. ip . The success of our approach is both proven formally 
in Section [3] (under reasonable analytical assumptions) ancil demonstrated empirically on the 
Occlusion Culling problem in Computer graphics: see Sections [U O [6] and [71 

2 Randomized Adaptive Hierarchical Subdivision 

Consider a d-dimensional cuboid C = [ai,6i] x [02,62] • • • [o-d,bd\ ^ K'^. We want to approx- 
imate an unknown function / : C ^ M, accessable through blackbox queries for its values 
f{x) at given arguments cc, by a piece-constant function consisting of 'few and simple' pieces. 



Note that we do not claim Occlusion Culling to behave Lipschitz-continuously, but rather consider ii) and 
iii) as two classes of scenarios that benefit from i) 
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Algorithm 1. Fix dimension d G N, sample size k, so-called splitting threshold s > 0, and 
the unknown / : C C M"^ — > R. 

• Sample k arguments Xi, . . . ,Xk £ C independently uniformly at random. 

• Query values yi := f{xi), 1 < i < k. 

• If \yi — y-j\ < s holds for all 1 < i, j < k, 

replace f on C by the constant function g = z := avrg(yi, . . . ,yk). 

• Otherwise cut C into 2'^ subcuboids of equal size 

and recurse to (the restrictions of f to) each of them. 

The underlying idea is simple: If values f{xi) and f{xj) deviate too much, then the cuboid 
cannot be accurately described by a constant function on entire C. Note that the recursive 




Fig. 1. Example piecewise constant approximation g of a function / 
as output by Algorithm [1) Shades of gray indicate the values of g. 



[ai,bi] X [02,62] 



process of Algorithm [T] yields a piecewise constant function g, in which the locally constant 
parts constitute a hierarchical subdivision known in 2D as quadtree (indicated in Figured]) 
and in 3D as octree. In particular, this g naturally comes with a very practical and efficient 
representation as a data structure: a hierarchical subdivision. That is. Algorithm [1] can be 
considered as preprocessing / for the following interpolation: 

Algorithm 2. Fix a d-dimensional octree T over C as produced by Algorithm[l\ 

Given x & C, iteratively 

• determine which of the 2'^ subcuboids C of C this input x lies in; 

• proceed to this C (i.e. to the corresponding subtree ofT) 

• unless C is already a leaf of T. 

• In the latter case, return the constant value z of g on C. 

Because of the exponential term 2^^, this approach can be efficient only for small values of d. (In 
Section H] we will apply it for d = 2^: two spatial dimensions and a discrete directional one.) 
The piecewise constant function g, output by Algorithm [1] (and thus also the extrapolated 
values produced by Algorithm [2]) can of course not be expected to approximate the given / 
in general. But Section [3] asserts that it does so for Lipschitz-continuous / and sufficiently 
large sample sizes. 
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2.1 Applications of the Hierarchical Subdivision to Algorithm Engineering 

Consider some algorithm A operating on (integral or continuous) inputs x from the cuboid 
C. Let f{x) denote the value at input x of some profileable quantitative property of A. 
This may for instance be running time, memory consumption, number of instructions etc. 
(Section U] will for example consider the number of occlusion queries.) Then Algorithm [1] 
produces a hierarchical subdivision of C and piecewise constant approximation g to f from 
profiled values f{x) at 'few' adaptively sampled inputs x; and Algorithm [2] uses this data 
structure to approximately predict the value f{x) (i.e. the quantitative property of algorithm 
A) on other inputs x G C. 

Now this approach cannot be expected to succeed all the time and for every A. On the other 
hand many practical algorithms (in particular those operating on continuous, e.g. floating 
point data) do exhibit some sort of continuity — if not strictly, then in the mollified sense of 
local averages. In fact such kind of benignity is analogous to the temporal and spatial locality 
hypotheses which both, caching (data) and instruction prefetch / branch prediction (control) 
techniques successfully do rely on. And if Algorithm [T] does yield a good approximation of 
the behavior of A, this can be employed in various ways: 

Realistic Rating and Comparison of Algorithms: A picture as in Figure [U summa- 
rizes the behavior of the underlying algorithm A rather nicely. One can easily read off regions 
of inputs, in which A takes long (dark) or little time (bright). A similar one for another 
algorithm B permits to decide for which regions A may succeed over B and by how much. 

Average-Case Performance Estimation for Generic Input Distributions: Worst- 
case running times often suffer from few, but practically rare 'bad' inputs. An average case 
statement is restricted to one specific input distribution and may be useless to another. The 
hierarchical decomposition produced by Algorithm [1] on the other hand allows to estimate 
average case properties for many distributions on input space. In fact the octree needs to be 
determined only once: a distribution then amounts to assigning weights to the sub(sub)cuboids 
and the induced average case property to a mere (re-)calculation of their weighted average. 
The worst-case can be read off as well. 

Empirical Algorithm Evaluation on Generic Hardware: Algorithm [1] can be applied 
to the same A in order to determine various profileable quantities separately, for instance 
to count how often it performs operation ai, , 02, etc. . Now suppose that on some specific 
hardware H, ai takes time tj: Then the total time used by A can be estimated as Yli fi ' 
(for sequential execution; maxj fi ■ ti for parallel, and similarly for mixed execution) by mere 
calculation, i.e. without actually having to execute A on H. 

Parameter Optimization: Suppose some algorithm A has d input dimensions and k 
parameters. A common topic in algorithm engineering is how to choose these parameters (in 
dependence of the specific inputs and hardware) in order to to gain optimum performance. For 
example, many tree-based algorithms behave much better when collapsing all 'small' subtrees 
of size below a certain threshold into simple arrays. This threshold value is the parameter to 
be optimized. Again, this is a problem which the data structure generated by Algorithm [1] 
can help solving. In Figure [T] for instance, y as a parameter would be chosen in dependence 
of input X as indicated by the dashed line because that yields an overall performance below 
the 10% grey level. 

Automatic Adaption to Computational Environments combines parameter opti- 
mization with the performance prediction from generic to specific hardware. 
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2.2 Applications of the Subdivision to the Rendering of Virtual Scenes 

In this section, we transfer the proposed apphcations to Algorithm Engineering in the field of 
real-time rendering of 3d-scenes. We give an introduction to the problems and the algorithms 
we use for the examples and give an overview over the covered applications of the subdivision 
method in this domain. 

Our field of interest is the occlusion culling problem in computer graphics: Interactive 
display of highly complex scenes like landscapes with extensive vegetation, large city models, 
or highly detailed datasets in scientific visualization are a major challenge for rendering algo- 
rithms. Although the capabilities of computer graphics hardware have dramatically increased 
in the past decades, the handling of scene complexity is still one of the most fundamental 
problems in computer graphics. A large number of approaches have been proposed to handle 
high scene complexity in interactive applications |MH99ILRC-|-03 | . In this paper we focus on 
occlusion culling: trying to avoid bothering with rendering parts of the scene which are not 
visible to the observer anyway. A multitude of algorithms has been suggested and is being 
employed for this purpose |CCSD03|Dur00j . This raises even more questions of whether and 
under which circumstances, one algorithm may be superior to the other. 

We apply and evaluate the approach of Section 12.11 as a means to solve this problem. 
Specifically we consider two fundamentally different rendering algorithms: the brute-force way 
of sending all triangles of the scene to the graphics pipeline and a recent one [BWPP04] . The 
latter algorithm uses a feature of modern graphic adapters, which allows counting an object's 
number of pixels , which pass the depth test of the rendering pipeline and are therefore 
not occluded by a previous object. In order to make use of this feature, the virtual scene is 
organized in a tree that represents a bounding volume hierarchy, in which the axis aligned 
bounding box of a node encloses the boxes and geometrical objects of all children (here 
we use an octree). The rendering algorithm traverses the nodes of the tree (the nodes are 
thereby ordered front-to-back from the observer's position) and before a node is rendered, 
it's bounding box is tested for if it does contribute at least one pixel to the frame buffer. If 
it's bounding box is not fully occluded, all associated objects are rendered and the traversal 
continues with the child nodes. If the box is hidden, the corresponding subtree is skipped for 
this frame since all children can only lie inside this box, as well and are completely occluded. 
As the visibility test itself needs to pass the rendering pipeline, it takes some time for it's 
result to be available. To hide this delay, the algorithm continues with the rendering of the 
scene and updates the visibility information when the result arrives. This can result in the 
futile rendering of some hidden nodes, but is still faster than locked waiting. 

In the evaluation we apply the presented applications (Section 12. ip to these rendering 
algorithms in the field of computer graphics. 

1. We use hierarchical subdivisions to globally compare the algorithms according to their 
running time and evaluate the occlusion culling efficiency (in Section S|) . 

2. We use the average values of subdivisions according to running time in order to identify 
the optimal value for the maximum octree-depth in the example setting (In Section [5]) . 

3. Section [6] gives an overview on how the running time can be predicted with subdivisions 
according to the components of a cost function for generic hardware. 

4. In Section [7] we present how subdivisions in this area can be extended to include addi- 
tional information about the viewing direction in order to online select the best rendering 
algorithm depending on the position and the viewing direction. 
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2.3 Related Work 

There is a vast literature on algorithmic methods for adaptively and hierarchically approxi- 
mating an unknown function / piecewise by simpler ones. 

i) Numerical treatment of a function / on some smooth manifold M (e.g. the solution to 
a partial differential equation using the Finite Elements Method FEM) generally proceeds by 
first triangulating A4 and replacing / on each such triangle by a linear function. For reasons 
of accuracy, the thus considered mesh on M is usually desired to be finer on regions of high 
curvature (corresponding to large variations of /) and coarse on 'flat' parts of M; cf. e.g. 
|PLW05j . 

ii) Also well-known are various methods of approximately 'learning' an unknown function 
/ by querying its values f{x) on appropriately chosen arguments Xi. This is in fact the 
essence of information-based complexity [TWW88] . e.g. for numerically integrating /. Again, 
the evaluation points are preferably chosen adaptively: more densely if, and where, / exhibits 
strong variations. 

A synthesis of i) and ii) concerning Lipschitz-continuous functions (cf. Section[3]below), the 
work [BDKOSj is concerned with ID integration; and |Coop95|Beli06| focus on deterministic 
uniform approximation of / by piecewise linear functions. We point out that i+ii) can also 
be considered as lossy function compression problems: replacing a (possibly complicated) / 
by some simple g resembling /. 'Simple', here means, piecewise constant or linear; for other 
classes of simple functions (like, e.g., sines and cosines) one arrives at wavelets and Fourier 
compression with famous applications such as mpS and jpeg. 

The main idea of the present work is to apply these methods to a seemingly unrelated but 
notorious problem in Algorithm Engineering in general and especially in computer graphics: 

iii) Predicting the behavior of an algorithm, e.g., runtime. This has been a major topic par- 
ticularly in parallel and distributed computing — compare e.g. |FRW96|BMW02|BGLR04] — 
and to computer graphics [FS93|WiWo03] . 

iv) Evaluating the efficiency (runtime, frames) of a rendering algorithm. Typically, the tar- 
get function is measured along a chosen camera path |FS93j or with an increasing scene com- 
plexity (number of polygons) |CDL+9&| or for some fixed chosen viewing points [PZvBGOO] . 

v) To maintain an adaptive rendering algorithm. This has been applied to real-time ren- 
dering systems to adaptively adjust image quality in oder to maintain a uniform, user-specified 
target frame rate |FS93] . 

3 Asymptotic Analysis and Correctness of the Approach 

As already mentioned. Algorithms [1] and [2] cannot be expected to succeed on an arbitrary 
unknown function. We now prove that they do approximate such / : C ^ M with high 
probability, if / is Lipschitz-continuous, provided that the sample size is large enough. More 
precisely Theorem [4] asserts successful approximation up to given absolute error 

• uniformly for sample size roughly proportional to the volume of C 

• in the least squares-sense for sample size roughly proportional to the diameter of C. 

Both are shown asymptotically best possible. 
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3.1 Reminder and Properties of Lipschitz-continuous Functions 

For c > 0, a classical notion in calculus calls a function / : dom(/) C M'* — > M c-Lipschitz if 
\f{x) — f{y)\ < c - \\x — y\\ holds for all x,y G dom(/). For instance any differentiable / with 
derivative bound ||/'(a;)||' < c is c-Lipschitz. Here (|| • ||, || • ||') denotes a dual pair of norms, 
which is satisfying the Holder Inequality \ {x,y)\ < \\x\\ ■ \\y\\' for all x,y £ W^; e.g. Euclidean 
norms, or ||a;|| = \xi\P)^^P and ||y||' = lUil'^)^^'^ with 1 = l/p + l/q. 

We also remark that any continuous function on C is c-Lipschitz for some (but possibly 
very large) c, because C is compact. 

Lemma 3. Let /i denote a probability measure on C = [ai,b\\ x [a2, ^2] • • • [od, ^d] ^ , 
i.e. with 1 = /i(C) = f^ldix. Moreover write diam(C) := sup{||a; — y\\ : x,y £ C} and 
B{x,r) := {y G C : \\x — y\\ < r}. Finally consider a c-Lipschitz function / : C ^ M. 

a) For s > and y £ C, it holds: {x £ C : \ f{x) - f{y)\ < s} ^ B{y, s/c). 

b) Let A := minygc" s/c)) and suppose supQ f — infc / > 4s. Then k := 1/A points 
X, sampled from C according to the distribution fi, contain with constant probability some 
Xi,Xj such that \ f{xi) — f{xj)\ > s. 

c) For any y G C and s > and measurable C C C , it holds 

[ \fix)-fiy)\dfi{x) < s + c-dmm{C)-fi{xeC' :\fix)-fiy)\>s} . 

d) Let fi either be the normalized Lebesgues measure on C or the normalized integer counting 
measure on C, i.e. /i(C7') = Card(C7' n Z'^)/ Card(C n Z'^). Then it holds 

j f{xfdii{x) < O(VdiamC) • ( j \f{x)\ dii{x)^^ + 0{c^). 

Concerning d) remember that, without Lipschitz-condition, f{x)'^ dx in general cannot be 
bounded in terms of \f{x) \ dx: consider x ^ \l\fx. Also, V diam C is asymptotically best 
possible in ID, since the 1-Lipschitz function on [0, n] depicted in Figure [2)3) has, for the 
normalized Lebesgues measure d^ = dxjn, J I/I dfi = 9(1) and / l/p dfi = ©(n^/^). Similarly, 
the sample size of Lemma [8)3) is asymptotically best possible for the function depicted in 
Figure [2ti) . 




Fig. 2. a) A sample of size I7(cube volume/Lipschitz constant) is generally necessary in order 
to get uniform approximation of an unknown /. b) 1-Lipschitz function on [0,n] with J \ f\ = 
0(n) and //2 = 0(n3/2) 



Proof (Lemma\^. a) Since / is c-Lipschitz, \\x — y\\ < s/c implies \ f{x) — f{y)\ < s. 
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b) Let z := (sup / + inf /)/2, y+,y G C with f{y+) = sup / and f{y-) = inf/, exploiting 
that / is continuous on compact C. Then C+ := {x £ C : f{x) > z + s} ^ {x : 
\f{x) - f{y+)\ < s} and C_ := {x : f{x) < z - s] both have ^(C+), /x(C_) > A by a). 
Hence 1/A points x sampled from C have probability 1 — (1 — A)^/^ > 1 — 1/e of hitting 
some Xi G C+; similarly for hitting some Xj S C_: and since C+nC_ = are independent, 
a constant probability > (1 — for both. These satisfy \f{xi) — f{xj)\ > s. 

c) Observe that probability measure satisfies fi{x £ C : \f{x) — f{y)\ < s} < fJ-{C) = 1 and 

J \f{x) - f{y)\ dfi{x) < j sd^i{x) + j c-\\x-y\\dii{x) 

C x&C':\f{x)-f{y)\<a xeC' :\f (x) - f (y)\> s 

d) W.l.o.g. suppose / : [0, iV) — > M is nonnegative and 1-Lipschitz; otherwise consider |/|/c. 
Consider the real sequence y defined by j/o '■= ™ax{l,/(0)}, yi := max{l, /(yo)}, y2 '■= 

max{l,/(yo+yi)}, Vi+i ■= max{l, /(yo+yiH hj/i)}- Since > 1, it holds J^iKnVi ^ ^ 

for some n < N; truncate yn such that X]j<n ^ X]i<n Hi ~ Moreover let / C {1, . . . , n} 
denote the set of those i with y^ > 1 and in particular y^ = fiJ^jKiVj)- Because of the 
Lipschitz condition, it holds 

E{*) 
yj + x) < yi + X for all < a; < yi and all i G /; (*) even for all i. 

j<i 

Thus, J^^ /(E,<. yj +x)dx> yf/2 ioiiel and f^' /(E^i + < yf ■ 7/3 for ah 

i; in particular < 7/3 for i ^ I. Therefore, 



\f\dx = l/N-Y, / 
'0 i ■'T.j<^y] ia iei 

with the abbreviation y := {yi)-^^. Similarly, 

1/iV • / dx<l/N-Y, " I/I' + 1/iV • 5^ 7/3 

iel ■^^j<^y^ i0 

<|ll2/lli/lll/lli + 7/3 <^ i^-||l/||2/lll/ll? + 7/3 

where at (**) we have employed Jensen's Inequality ||y||3 < ||y||2 as well as the bound 
ll(yi,---,yfc)l|i < \/fc||(yi,...,yfc)||2. □ 



3.2 Sample Sizes for Uniform and for Residual Sum- of- Squares Approximation 

First note that Algorithm [T] terminates for c-Lipschitz continuous functions / : C ^ M: If 
c-diam(C) < s holds, then line 3 strikes; and this happens latest at a recursion depth of order 
log (c • diam(C)/s). 

Theorem 4. Let /i denote either the normalized Lebesgues measure or the normalized integer 
counting measure on C . 

i) Consider k' := vol(C) • c'^/s'^ with vol(Xj[aj, bi]) := Y[i=i(.^i ~ ^i)- Then AlgorithmUl with 
k := k' ■ log^{k') produces g such that \\f — g\\oo < 4s holds with high probability; 
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a) Suppose s > c and consider k' := V diam C + c • diam(C)/s. Then Algorithm ll\ with 
k := k' ■ log^(fe') produces g such that \\f — g\\2 < 4s holds with high probability. 

We remark that the uniform approximation in i) is based on samples of size k, essentially linear 
in vol(C); whereas the least squares approximation in ii) takes samples of size proportional 
to the diameter — which is asymptotically smaller in dimensions d > 2. Note that a sample 
size of order diam^~^ /s is necessary in the worst case: 

Example 5. Fix e > and generalize the function in Figure\^) to 1-Lipschitz : [0, n] — > 
[0, n] defined by 

f{x) := n^"'^'^ — X for x <m := n^^'^'' + a ■ n"^"^*^, f{x) := —a ■ n^""^*^ for x > m . 



Then " ' f{x)dx = n^-^'/2, /J.^, f{x)dx = -o? ■ n^'^^l, and !lf{x)dx = -{n - 
m) ■ a ■ n^"^*^; hence the mean is f{x) = for some appropriate a = ^ + o(l). More- 
over Jq" \f{x)\'^dx = n^~^^/3 and J^i_2e = 0{n^~^^), hence the variance is 



a = yl/n- J^\f\'^dx = 0{n^ ^^). Now consider s := a/2 and observe that, in order to 
approximate f up to error s, any algorithm has to distinguish it from some other func- 
tion f : [0, n] — > [±s/2] and therefore needs to detect Xi,Xj with f{xi) — f{xj) > s. Since 
f{x) G [— ©(n^"^*^), 0] C [— s/2,+s/2] for x > n^~'^'^ and n sufficiently large, such an algo- 
rithm must in particular (yet does not suffice to) find at least one Xi < n^"^*^; which for one 
single sample happens with probability v}~'^'^/n = 0{s/n^~'^) where n = diam[0,n]. □ 

Proof (Theorem^. 

i) First observe that in case supj^ / — infc / < 4s, any sampled x £ C will yield g := f{x) 
with 11/ 

~ 5II00 ^ 4s; whereas in case sup(^ / — infc/ > 4s, Algorithm [1] has a high 
probability of cutting C into smaller parts: this follows from Lemma [Sjj) by probability 
amplification due to the logarithmic oversampling. Since the algorithm has logarithmic 
recursion depth with sub-subcuboids of exponentially fast decreasing size before arriving 
at constant volume, the log^-factor maintains the high success probability throughout. 

ii) The mean z := Juf{x)dfi{x) is well-known to minimize \ fix) — z\'^ dfj.{x) = \\f — z\\2 
(or, equivalently, ||/ — z\\2). Observe that C+ := {x e C : f{x) > z} and C- := {x e C : 
f{x) < z} satisfy 1 = fJ.{C-) + IJ.{C+) and 



= / {f{x)-z)did{x) = / \f{x) - z\did{x) - / \f{x)-z\dfj,{x) , 
Jc Jc+ Jc- 

/ \f(.x) - z\dK^) = / \f{x) - z\dii{x) + / \f{x) - z\dii{x) ; 
Jc Jc+ JC- 

hence /^^ \ f{x) - z\ dfi{x) = ^ \fix) - z\ d^i{x). W.l.o.g. ^((7+) > 1/2. 
Now first suppose s > \\\f — z||2/v^diamC. Recall Bernstein's Inequality 



ks' 

< 2 ■ exp 



2(72 + (5 



a • s 



for k independent random variables Xi £ [a,b] with mean z and variance a = \\f — z\\2. 
Here Xi := f{xi) G [z it c • diamC] because of the Lipschitz condition. Hence k > 32 ■ 
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V diam C + c ■ diam(C)/s samples suffice for the sample average in line 4 of Algorithm [T] 
to be closer than s to the true mean z with constant probability. 
This time suppose s > |||/ — z\\2/ v^diamC. Then, by LemmafSh), 



H{{x £ C : f{x) < z - s}) = n{{x £ C- : \f{x) - z\ > s}) > 
> i^j \f{x)- z\dx - s^/ic-diamC) = 

where, according to Lemma [3]1) and in the big-Oh sense, 



ll/-^lli/2- 
c • diam C 



\\f-z\\l > ^\\f-zg-CyVdi^ > (||/-z||2-c)/^di^ C > 4s — c/v^ diamC , 

hence n{{x £ C : f{x) < z — s}) > (3s — c)/(c • diamC). So a sample of size k > 
(c • diam C) / (3s — c) has constant probability for Algorithm [T] to find some Xi £ C with 
f{xi) < z — s and some Xj £ C+, i.e. to proceed to its fourth line. □ 



4 Empirical Evaluation and Comparison of Occlusion Culling Algorithms 

Our approach based on hierarchical subdivision introduces a powerful alternative or addition 
to the standard way of evaluating a target function (e.g. the running time) of a rendering al- 
gorithm along some (hopefully carefully chosen) camera path or at certain observer positions. 

In order to determine whether or not this algorithm is appropriate for a specific applica- 
tion, it is necessary to evaluate the algorithm with respect to different requirements. One such 
requirement for an occlusion culling algorithm could be for example that it actually identi- 
fies and rejects a sufficiently large part of the hidden objects during the rendering process. 
More relevantly: Does the algorithm increase the overall frame rate of the scene on the target 
system compared to a simpler algorithm without occlusion culling, taking into account the 
overhead introduced by the occlusion tests. 




Fig. 3. Example scene con- 
sisting of 625 objects (6.2M 
polygons) 




Fig. 4. Subdivision according 
to the number of visible ob- 
jects (resolution 256 x 256 x 1, 
splitting threshold 40) 




Fig. 5. Subdivision according 
to the number of as visible 
classified objects (resolution 
256 X 256 X 1, splitting thresh- 
old 80) 



Figure [3] shows the scene we use for the following examples. It consists of 625 Objects (of 
6.2 million triangles altogether), mostly trees (which individually produce only little occlusion) 
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and some opaque walls. To inspect the functionality of our chosen occlusion culling algorithm 
in this setting, we construct the hierarchical subdivision with respect to the function "number 
of visible objects" . Visibility is determined by projecting (rendering) the scene onto the sides 
of a cube around the current position for which the (visibility-)function is evaluated. An 
object is considered visible if it contributes at least one pixel to the rendered image of one of 
the six sides of the cube. 

In our implementation of the sampling approach, we restrict the sample space to a discrete 
set of points in 3d-space arranged on a grid, so that the maximum size of the subdivision 
is bounded by the resolution of the grid (the smallest cell of a subdivision is one cell of the 
grid). This resolution can be chosen as one parameter to adapt the resulting subdivision to 
the needs of the intended application. The sampling size is set proportional to the diameter 
of the current area in grid points (times 0.5). 

The created subdivision with a grid-resolution of 256 x 256 x 1 (in this example we inspect 
only one layer) is shown in FigureHJ the areas where only few objects are visible are colored red 
and the areas where almost all objects are visible are colored green. The three dimensional 
visualization of the data gives a very intuitive impression of the actual distribution of the 
scene's visibility function. To test how many of the hidden objects of this scene our occlusion 
culling algorithm identifies, we create an additional subdivision according to the number of 
objects the algorithm classifies as visible. The value of the function is measured by executing 
the rendering algorithm and counting the objects that pass the occlusion test on at least one 
side of the cube. 

Just by comparing the resulting image (Figure [5]) to the previous subdivision according 
to the visibility you can see that in areas where many objects are occluded, the number of 
rendered objects is indeed smaller; but the number of rendered objects is in general larger 
than the number of visible objects. This can be further analyzed by creating an additional 
subdivision of the difference between the visible and rendered objects (which can easily be 
calculated from the existing subdivisions). This difference subdivision indicates the number 
of unnecessarily rendered objects. 

The question arises, whether the amount of culled objects is sufficient in relation to the 
computational overhead, introduced by the tests to increase the overall frame rate during 
rendering. Therefore, we compare the subdivisions according to the running time of the two 
algorithms (simple rendering and occlusion culling). The function is evaluated by measuring 
the rendering tim^ at the given position for the six directions of the surrounding cube and 
then taking the maximum of these values (this approach emerged to identify the most relevant 
value in our experiments). The difference subdivision of these two subdivisions (see Figure 
[6|) shows the areas, where the occlusion culling algorithm outperforms the simple rendering 
in blue. Inside these areas, the number of occluded (and identified as such) objects is high 
enough to compensate the additional costs for the tests. But if the goal is to minimize the 
average rendering time for all positions, then in this configuration simple rendering without 
occlusion culling should be preferred (average rendering time for all positions: 20.3 ms with 
occlusion culling, 19.7 ms without). 



^ Test system: Intel(R) Core(TM)2 CPU 6600 (2x2.4 GHz), 2GB RAM, Vga: ATI Radeon HD 2600 XT with 
512MB RAM 
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Fig. 6. Area (in blue) where on the test system Fig. 7. Average running times according to dif- 
occlusion culhng pays of, other areas are not ferent octree depths, 
displayed; resolution 256 x 256 x 8; the scene 
is organized in an octree with depth 5. 

5 Parameter Tuning: Maxmimum Octree Depth 

The used occlusion culling algorithm does not perform a test for every single object, but tests 
the visibility of all objects that are contained in an octree node simultaneously. This reduces 
the number of tests but increases the number of occluded objects, which are rendered by 
mistake. The number of objects per node can be controlled (in our implementation) by the 
maximum depth of the octree, which can be declared as a parameter in the preprocessing. 
High values result in a deep octree with only few objects per node and in a good identification 
of visible objects. Lower values on the other hand result in more errors during the tests, but 
do also reduce the number of necessary tests. 

This is one example for a parameter which can be adapted to the scene and the used 
hardware with the help of valued subdivisions. We define the optimum as the value, for which 
the average rendering time is minimized for the area of interest (another definition could for 
example be, that the area where a certain rendering time is exceeded should me minimized). 
The area, for which the subdivision is calculated should correspond to the appointed appli- 
cation. If the scene is used in a walk-through application, in which the observer mainly stays 
near the ground, the area of the subdivision should also only be created in this area. For a 
flight simulation, the efficiency of the algorithm above the ground may result in a different 
solution, as only few objects are occluded from above. 

In order to determine the best value for walking through the example scene, we create 
multiple subdivisions according to the running time of the rendering algorithm with different 
values and calculate the average. Figure [7] shows the average values for the different octree 
depths; a value of four results in the lowest average rendering time and is therefore the best 
choice for the given setting according to the definition. 

6 Rendering- and Culling-Time Prediction 

Subdivisions according to the actual runtime are a very practical tool for tuning an algo- 
rithm's parameters for a specific hardware, but they give only little insight into the internal 
characteristics and bottlenecks of the analyzed algorithm. If the dependency between the 
efficiency of the rendering algorithm and the properties of the hardware and the scene are 
understandable, the hardware can be chosen to fit the needs (e.g. in industrial applications 
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like real time simulations) or the input (the scene) can be adapted to the abilities of a special 
hardware (e.g. designing scenes for computer games). In order to create a runtime prediction 
for generic hardware, the first step is to identify the dominant operations of the algorithm 
and to formulate a function that predicts the running time in dependency of these opera- 
tions (which is not a trivial task!). Then we create subdivisions according to the number of 
operations the algorithm performs. At this point an (hopefully negligible) error can be intro- 
duced, as the number of operations may also be dependent on the actual hardware (especially 
when the algorithm heavily exploits parallelism). The subdivisions according to the number 
of operations can then be combined with the actual costs of a specific system according to 
the function for predicting the runtime. This results in a new subdivision according to the 
predicted running time for the test system (Probably, this process has to be repeated a few 
times until the model fits the algorithm.). 

As an example we introduce a simplified runtime prediction for the used occlusion culling 
algorithm: costtotalipos) = costpoiy * numPolygons{pos) + ostocdTest * numOcdTests{pos) . 
We create the according subdivisions (see Figures [51 [9]) for the operations and measure the 
average costs costpoiy = 4 * 10~^ms and costocciTest = 0.052ms on the test system. Although 
we made some rough simplifications for this example, experiments showed that the results 
(see Figure [TOj) seem to be a reasonable model for the behavior of the algorithm. 





Fig. 8. Subdivision 
according to number 
of occlusion tests; 
min (black) 20, avg 
265, max (blue) 362. 



Fig. 9. Subdivision 
according to number 
of rednered polygons; 
min (black) 84, avg 
3.4M, max (green) 
5.7M. 



Fig. 10. 

vision 
ing to 
rendering 

Cpoly=4: * 

ctest =0. 052ms; min 
(green) 1.5ms, avg 
27ms, max (red) 
40.4ms. 



Subdi- 
accord- 
expected 
time; 
10~^ms, 



Fig. 11. Part of a 
subdivision with 
viewing directions; 
blue: occlusion 
culling is faster, 
red: simple rendering 
is faster. 



7 Automatically and Adaptively Selecting Culling Algorithms 

We showed several possibilities to evaluate, select and adapt a rendering algorithm in the 
preprocessing, but it is also possible to use the additional information contained in a sub- 
division online during, the walk-through. As we have seen in Figure [U even in our simple 
example it is dependent on the position of the observer which rendering algorithm is better 
(in our case that means faster; but could also mean: less approximation errors, more details. 
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etc.). If the behavior of multiple algorithms can be estimated by different subdivisions and 
the used rendering algorithms can be switched during runtime (e.g. it may be necessary that 
the algorithms can all work with the same data structure the scene is organized in), then 
we can always select the best algorithm according to the predictions. While only considering 
the actual position of the observer, this can be achieved with the presented techniques. But 
besides the position, the viewing direction of the observer has an important influence on the 
algorithm (although this is also true for the other applications, there it is mostly sufficient to 
acquire only one value per position in order to evaluate the general or average performance). 
To cover this additional information it is e.g. possible to extend the domain of the sampling 
by two additional dimensions for the viewing direction. In our implementation we use another 
approach. Its main idea is to extend the dimension of the target function to six (one value 
for each side of the surrounding cube). For the decision whether the current area has to be 
split up during the sampling process, we check the difference of each single dimension to the 
average value of this dimension independently. If the difference in one dimension is larger 
than the splitting threshold, the area is subdivided. 

During the walk-through, the values for the sides are interpolated according to the pro- 
jected sizes of the sides on the current viewing rectangle. Figure [TT] shows a part of the 
visualization of the difference subdivision between the runtime subdivisions with and without 
occlusion culling with the viewing direction extension (for which it is quite challenging finding 
an acceptable visualization). When during the walk-through, the observer is moving through 
a cube and mostly sees red sides then occlusion culling is probably more efficient than simple 
rendering. When she would mainly see blue sides of the cube, the occlusion culling does not 
pay off and simple rendering is likely to be faster. 



8 Subdivision Quality 



When creating a subdivision, there are different parameters for adapting the result to the 
corresponding requirements resulting from the intended application. When the aim is to 
evaluate an algorithm in a more comprehensive way than simply using a camera path, it 
may not be important to have a very fine grained underlying grid and even a distinction in 
different viewing directions, but it may be helpful to get an intuitive visualization of the data 
and reliable average values. If a subdivision is used to select the best algorithm at runtime, 
the demands on the accuracy are higher. Any decision decision should correct (or produce 
slight errors only) at almost all observer positions. Therefore a high resolution of the sampling 
grid, a low split value and viewing direction dependency are necessary. 

Figure [12] shows the number of distinct samples needed for the calculation of a subdivision 
(every function value is internally cached, so that the running time is mainly determined by 
the number of distinct values). The time for creating this example subdivision reach from 
over 30 minutes (splitting threshold 10) to 20 seconds (splitting threshold 500). Figure [13] 
shows the development of the average value and the error according to the value calculated 
for every grid point. In our experiments even a high splitting threshold leads to a good average 
value. If a low average error is important at every position, a lower threshold has to be chosen 
(procedure: first choose high splitting threshold, lower iteratively until result is satisfying). 
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Fig. 12. Number of distinct samples needed 
for creating subdivision according to visibility 
with a grid resolution of 64 x 64 x 8. 



Fig. 13. Average value of the subdivision and 
average error compared to the exact value of 
each grid cell in dependency to splitting thresh- 
old 



9 Conclusion and Perspectives 

We have shown the benefit of our approach in order to automatically determine whether and 
when a specific hardware-based occlusion culling algorithm yields a net benefit over a brute- 
force renderer. In the future we will extend this method from this on/off-problem towards a 
finer tuning: passing to the culling algorithm a parameter controlling how careful it is to filter 
(i.e. how much computational effort to spend on finding) occluded objects in order to yield 
the best net performance. Algorithm [T] approximates (and succeeds with high probability 
on) an unknown Lipschitz-continuous function / by replacing it with piecewise constant g. 
Inspired by the works |Coop95|Beli06| , it seems promising to generalize our approach and use 
a piecewise linear g. 
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